19:45
2026-06-14
lesswrong.com
ai-safety
Why Do Naive SFT Filters For Safety Properties Fail?
Google DeepMind researchers investigate why filtering supervised fine-tuning (SFT) data fails to remove safety-relevant properties from language models, proposing a method to identify the source of thβ¦